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GOOD, GREAT, OR LUCKY? SCREENING FOR FIRMS WITH 
SUSTAINED SUPERIOR PERFORMANCE USING 
HEAVY-TAILED PRIORS 

By Nicholas G. Polson and James G. Scott 

University of Chicago and University of Texas at Austin 

This paper examines historical patterns of ROA (return on as- 
sets) for a cohort of 53,038 publicly traded firms across 93 countries, 
measured over the past 45 years. Our goal is to screen for firms whose 
ROA trajectories suggest that they have systematically outperformed 
their peer groups over time. Such a project faces at least three statisti- 
cal difficulties: adjustment for relevant covariates, massive multiplic- 
ity, and longitudinal dependence. We conclude that, once these dif- 
ficulties are taken into account, demonstrably superior performance 
appears to be quite rare. We compare our findings with other re- 
cent management studies on the same subject, and with the popular 
literature on corporate success. 

Our methodological contribution is to propose a new class of priors 
for use in large-scale simultaneous testing. These priors are based on 
the hypergeometric inverted-beta family, and have two main attrac- 
tive features: heavy tails and computational tractability. The family 
is a four-parameter generalization of the normal/inverted-beta prior, 
and is the natural conjugate prior for shrinkage coefficients in a hier- 
archical normal model. Our results emphasize the usefulness of these 
heavy-tailed priors in large multiple-testing problems, as they have 
a mild rate of tail decay in the marginal likelihood m{y) — a property 
long recognized to be important in testing. 

1. Introduction. 

1.1. Large-scale screening of historical ROA data. Understanding the 
reasons why some firms thrive and others fail is one of the primary goals of 
research in strategic management. Studies that examine successful compa- 
nies to uncover the putative secrets of successful companies are very popular, 
both in the academic and popular literature. 



Received October 2010; revised September 2011. 

Key words and phrases. Corporate benchmarking, type-II beta distribution, multiple 
testing, normal scale mixtures, sparsity. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics. 
2012, Vol. 6, No. 1, 161-185. This reprint differs from the original in pagination 
and typographic detail. 



1 



2 



N. G. POLSON AND J. G. SCOTT 



Before the search for special causes can begin, however, success must be 
quantified and benchmarked. This is what our paper tries to do. In keeping 
with prior studies [McGahan and Porter (1999), Wiggins and Ruefii (2005), 
Henderson, Raynor and Ahmed (2009)], we use a common metric called 
ROA, or return on assets, to measure a company's success. This quantity 
gives investors some notion of how effectively a firm uses its available funds 
to produce income. It is fundamentally different from a market-based mea- 
sure like stock returns, which may fail to reflect underlying fundamentals 
over long periods of time (e.g., during bubbles), and which exhibit wild fluc- 
tuations that make the identification of trends problematic. Figure 1 shows 
three examples of firm-level ROA trajectories over time; these have been 
standardized using a procedure which we will soon describe. 

In this paper we apply Bayesian methods to historical ROA data, with 
the goal of comparing publicly traded companies against their peers. To be 
sure, ROA is an imperfect measure of corporate success, and our study will 
have the same shortcomings in this regard as any other that uses ROA as 
an outcome variable. One important practical reason for our use of ROA, 
aside from a desire to use the same metric as other researchers studying 
similar questions, is the sheer availability of data on companies from across 
the world (rather than just in the United States). This enables us to screen 
as large a database as possible: 645,456 records from 53,038 companies in 93 
different countries, spanning 1966-2008. In principle, however, our Bayesian 
statistical methodology could be applied to any outcome variable in any 
subpopulation of the corporate universe. 

We conclude that evidence of sustained superior performance is quite rare. 
To reach this conclusion, we use Bayesian models to compute the posterior 
probability that a firm falls into each of two classes: a null class, wherein 
deviations from the peer-group average are attributable to chance; and an 
alternative class, wherein these deviations, both positive and negative, are 
systematic. These posterior probabilities depend upon the particular as- 
sumptions made about the longitudinal persistence of "lucky" performances, 
in a manner soon to be explained. But even under the generous (and unre- 
alistic) assumption of longitudinal independence, we find that there are at 
most 1076 firms over the last 45 years for which there is moderately strong 
evidence of sustained superior performance over 5 years or more. We argue 
that this is a conservative upper bound on the number of such firms, and 
that the actual number is much smaller — our best estimate is 262, or 0.5% 
of all firms, once longitudinal dependence is taken into account. 

1.2. Statistical issues in identifying sustained superior performance. Any 
attempt to benchmark performance, and to identify sustained superior per- 
formers, must deal with at least three statistical challenges. 
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Fig. 1. Left: the actual performance of three firms (dots), superimposed on the benchmark 
distribution estimated from the Bayesian regression-tree model (black line and grey area, 
showing the posterior mean and 95% predictive interval of expected performance by all firms 
in the corresponding peer group). Right: these same firms placed on a common (normal 
CDF) scale of benchmarked performance, with the integers 0-9 representing the decile. 

First, one must adjust observed performance for relevant covariates. One 
important covariate is a firm's country of operation. Another one is a firm's 
industry; as Henderson, Raynor and Ahmed (2009) observe, some indus- 
tries exhibit structures that are intrinsically more favorable to monopolies, 
which would seem to be a source of advantage unrelated to managerial tal- 
ent or firm-level characteristics. Other potentially important characteristics 
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that have been explored in the literature include a firm's size and capital 
structure. 

Our method adjusts for the effect of all of these covariates, both on 
the conditional mean and conditional variance of performance. Importantly, 
there is no reason to assume that ROA depends upon them linearly. This is 
quite different from the situation in finance, for example, where the capital- 
asset pricing model (CAPM) and its variants predict a linear dependence 
between firm-level and market-level measures of performance. No such the- 
ory exists that would predict a parallel result for ROA. This means that 
nonlinear relationships must, at least in principle, be allowed. We do this 
using Bayesian treed-regression models, as described in Section 4. 

Second, even "lucky" performance trajectories may exhibit significant lon- 
gitudinal dependencies that lead to spurious declarations of impressiveness. 
Following Denrell (2005), imagine a very simple state-space model, wherein 

y t = ax t + e t , 

x t = bxt-i + ut, 

where yt is an observed performance metric, and xt is some underlying AR(1) 
firm- level characteristic (e.g., resources). Even if there is no systematic com- 
ponent of variation in xt, the observed yt's can still exhibit pronounced 
longitudinal autocorrelation, which can look very much like a sustained run 
of excellence. Formally correcting for such autocorrelation would require 
specific parametric models incorporating a wide variety of firm-level effects. 
Instead of taking this route, we try to correct for longitudinal dependence 
in a crude-but-simple fashion by estimating an effective sample size for each 
firm, and adjusting our Bayesian model accordingly. 

Finally, there is the issue of massive multiplicity. Given the large num- 
ber of hypothesis tests being conducted, and the frequentist leanings of the 
management-theory community, maintaining control over false positives is 
crucial. Yet having access to the posterior distribution of effect sizes can 
greatly inform follow-up case studies of individual firms, and is only possi- 
ble under a fully Bayesian model. This applied context makes a combined 
Bayes/frequentist approach especially appealing. 

Our paper's methodological innovation is to introduce a new class of 
heavy-tailed priors for the multiple-testing problem. We first give a brief 
overview of this problem from a Bayesian perspective (Section 2), deferring 
much of the details to Appendices. We then describe some simulation stud- 
ies in Section 3, which are designed to benchmark our proposed method 
against reasonable alternatives. In these studies, our methods show excel- 
lent performance in terms of limiting false positives, lending credence to the 
results for the actual data. Finally, we analyze the corporate ROA data in 
Section 4, where we also describe in further detail how we approach the 
other statistical issues we have raised. 
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2. Large-scale simultaneous testing. 

2.1. Methodological overview. In large-scale simultaneous testing, the 
goal is to uncover lower-dimensional signals from high-dimensional data. 
For example, researchers who use microarrays have long been interested 
in the problem of multiplicity adjustment, where "adjustment" can be un- 
derstood in the sense of adjusting one's tolerance for surprise as the set of 
potentially surprising events grows large. The same issue arises in all modern 
high-throughput experiments; other examples include functional magnetic- 
resonance imaging, environmental sensor networks, combinatorial chemistry, 
and proteomics. Too many type-I errors will mean too many expensive wild- 
goose chases. Hence, the case for a testing procedure that displays good 
frequentist properties is very compelling. 

But so too is the case for a model-based Bayesian procedure. These ex- 
periments may involve thousands of separate tests, and such a large volume 
of data often allows the distributional properties of "signals" and "noise" to 
be characterized quite precisely. 

This paper considers a new version of the two-groups multiple-testing 
model, where we observe data yi for i = (1, . . . ,p) according to a hierarchical 
model: 

( yi |A,^ 2 )~N(A,<7 2 ), 

(P i \w,0)~wg(p i \e) + (l-w)-6 o , 
w ~ p(w), 

a mixture of a Dirac measure at zero, and an alternative model g that is ab- 
solutely continuous with respect to the Lebesgue measure. (The alternative 
model g has hyperparameter 0, presumably also given a prior.) The most 
attractive feature of this model is that it automatically adjusts for multiplic- 
ity, without the need for ad-hoc regularization. This is because inference for 
the /3i's will involve the posterior for common mixing fraction, p(w | y). If 
one tests many noise observations in the presence of a few signals, then our 
estimate of w will be small, making it more difficult for all the observations 
to overcome the prior belief in their irrelevance. This exerts a powerful form 
of control over false positives. 

To handle the multiple-testing problem, we introduce a family of distri- 
butions g based on normal variance mixtures, where the mixing distribution 
is a hyper geometric inverted-beta (HIB) prior: 

(ft|A 2 , 74 = l)~N(0,cr 2 A 2 ), 

A 2 ~HIB(a,6,r,s), 

where the indicator 7i = 1 if is nonzero, and zero otherwise. We approach 
these priors from a hybrid Bayesian/frequentist perspective, using them to 
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compute not only posterior distributions, but also false-discovery rates, or 
FDR [Benjamini and Hochberg (1995)]. We also study the behavior of the 
posterior mean, which is competitive with existing gold-standard methods 
[e.g., Johnstone and Silverman (2004)] under squared-error loss. 

In both our data analysis and simulation studies, we focus on three key 
features of our approach: 

(1) The hypergeometric inverted-beta scale mixtures form an especially 
flexible class of symmetric, unimodal densities and can accommodate a wide 
range of tail behavior and behavior near the centering parameter. This class 
simultaneously generalizes the robust priors of Strawderman (1971) and 
Berger (1980), the normal-exponential-gamma prior of Griffin and Brown 
(2005), and the horseshoe prior of Carvalho, Poison and Scott (2010). The 
ability of our class to model heavy-tailed distributions with minimal com- 
putational fuss is of particular relevance in testing problems [see, e.g., Sec- 
tion 5.2 of Jeffreys (1961)]. 

(2) Our class of priors allows very easy computation of a wide array of im- 
portant Bayesian and frequentist quantities. This includes posterior means, 
variances, and higher-order moments; posterior null probabilities for indi- 
vidual observations; the score function; false-discovery rates; and local false- 
discovery rates [Efron (2008)]. The ease with which these quantities can be 
computed all relates to the analytical tractability of the marginal likelihood 
function m(y), whose importance we describe in Section 2.2. Appendix A 
provides all the details. 

(3) Our approach yields testing error rates that are competitive with ex- 
isting cutting-edge methods. At the same time, it also retains the advantages 
of a fully Bayesian procedure, in that in principle one has access to the joint 
posterior distribution of all parameters. 

Many of the technical details characterizing the behavior of the basic mix- 
ture model can be found in Scott and Berger (2006) and Bogdan, Chakrabarti 
and Ghosh (2008). These authors assume that the nonzero means follow 
a normal distribution, an assumption we generalize in this paper. Do, Muller 
and Tang (2005) also provide an interesting variation wherein the nonzero 
means are modeled nonparametrically using Dirichlet processes. 

The same issues arise in empirical-Bayes analysis. See, for example, John- 
stone and Silverman (2004), Abramovich et al. (2006) and Dahl and Newton 
(2007). Additionally, Muller, Parmigiani and Rice (2007), Bogdan, Ghosh 
and Tokdar (2008) and Park and Ghosh (2010). All describe the relationship 
between Bayesian multiple testing and classical approaches that control the 
false-discovery rate. 

2.2. The importance of the marginal likelihood function. Many common 
Bayesian and frequentist treatments of the multiple-testing problem can be 
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understood through the marginal likelihood functions 
m (y\a 2 )=N(y\0,a 2 ), 

m 1 (y\0)= [ N(y i \p l ,a 2 )g(l3 i \d)d^, 



m(y | 9, a 2 ) = w ■ m\{y) + (1 — w) ■ mo(y). 

First, following Efron (2008), the local FDR and the posterior probability 
of yi being noise are given by the same expression: 

fdr(y) = P^ = 0\yy,0)= ^- w) ;^ y \ 

m{y) 

Furthermore, if we let Fo(y) = mo(u) du, F\(y) = mi(u)du, and 
F(y) = w ■ F\{y) + (1 — w) ■ F${y), then the FDR is the tail area 

Second, the marginal likelihood function also arises in Masreliez's classic 
representation of the posterior mean. This gives an explicit expression for 
the Bayes estimator for under squared-error loss (assuming that 73 = 1): 

E(A |y,7i= 1) = Vi + -^-lnmi(yi), 
dyi 

versions of which appear in Masreliez (1975), Poison (1991), Pericchi and 
Smith (1992) and Carvalho, Poison and Scott (2010). The choice of alterna- 
tive model g(/3i \ 9) is crucial, insofar as it helps to determine m\(y). 

At the same time, the prior should have desirable statistical properties, 
with flat tails being a particularly important feature. The use of heavy-tailed 
priors for constructing robust shrinkage estimators has a long history, with 
prominent examples to be found in Strawderman (1971) and Berger (1980). 
Jeffreys, meanwhile, observed as early as 1939 that heavy-tailed priors play 
an important role in Bayesian hypothesis testing [see Jeffreys (1961), a later 
edition]. His arguments have been recapitulated in the context of linear 
models by Zellner and Siow (1980) and, more recently, Liang et al. (2008). 

The difficulty is that, while heavy-tailed priors lead to a desirably mild 
rate of tail decay in the marginal likelihood m(y), there are few such priors 
that are also analytically tractable. Any prior that possesses both proper- 
ties, as our proposed family does under certain hyperparameter choices, is 
therefore of great potential interest to Bayesians and non-Bayesians alike. 

We describe the hypergeometric-beta family of priors more fully in a leng- 
thy technical Appendix. But first we present simulation studies that demon- 
strate the usefulness of our approach for limiting false positives, before turn- 
ing to an analysis of the data set at hand. 
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3. Simulation studies. As our methodological Appendix shows, hyperge- 
ometric inverted-beta scale mixtures of normals are an especially useful class 
of priors for building discrete mixture models for ft, due to the existence of 
simple expressions for moments and marginals under the hypothesis that ft 
is nonzero: 

(3.1) (ft | ~ w ■ N(0, k~ 1 -l) + (l-w)- S , 

(3.2) Ki~HB(a,6,r,s), 

where 5$ indicates a degenerate distribution at 0. The posterior mean under 
this model is a natural estimator for (3 = (ft, . . . , ft), since it averages over 
uncertainty about whether each component is zero or nonzero. 

We conducted two simulation studies comparing the mean-squared error 
performance of our estimators with the procedure from Johnstone and Silver- 
man (2004), where ft is estimated by the posterior median under a mixture 
of a point mass zero and a double-exponential (Laplace) prior. We also keep 
track of the number of false positives generated by each procedure. 

Each of the two studies involved estimating signals from a different signal 
class. In all cases the dimension of the location vector was p = 1,000. 

Experiment 1: Fixed coefficients. Table 1 summarizes an experiment in- 
volving 12 configurations of different sparsity patterns (5, 50, and 100 nonzero 
means) and different scales (all nonzero means equal to 3, 4, 5, or 7). 

Experiment 2: Random t$- distributed coefficients. Table 2 summarizes an 
experiment in which the nonzero means were randomly drawn from a heavy- 
tailed t distribution with 5 degrees of freedom and scale parameter c. We 
investigated 12 configurations of different sparsity patterns (20, 50, 200, and 
500 nonzero means) and different scales (c = 0.5, 1, 2). 

Tables 1 and 2 show the average sum of squared errors in estimating (3 
over 100 independent data sets. Also shown are the average number of false 
positives declared by the two procedures in each case, and the average false- 
discovery rate. For the Johnstone/Silverman procedure, a false positive oc- 
curs when the posterior median of ft is nonzero, but the actual value is 
zero. For the Bayesian procedure using the hyper geometric inverted-beta 
prior, a false positive occurs when the posterior inclusion probability for ft 
is greater than 50% and ft is actually zero. This threshold reflects a 0-1 loss 
function that penalizes false positives and false negatives equally, regardless 
of size. A full decision-theoretic analysis incorporating more realistic loss 
functions would yield a different, data-adaptive threshold, but would only 
complicate the analysis slightly. 

For the hypergeometric inverted-beta prior, we set s = 0, while w and r 
were estimated by importance sampling. For priors, we assumed that r ~ 
C+(0,<t), and that w ~ Unif (0, 1). 



SCREENING FOR FIRMS WITH SUPERIOR PERFORMANCE 



9 



Table 1 

Experiment 1, fixed coefficients. SSE: sum of squared errors in the estimate of the f3 
sequence. FP: false positive declarations in the estimate of f3 sequence. FDR: realized 

false-discovery rate. Laplace: posterior median estimator from the empirical Bayes 
procedure of Johnstone and Silverman (2004). The numbers in parentheses indicate, in 
order, the choices of a and b the HIB model 

Number nonzero out of 1,000 means 



50 100 



Value: 



SSE Laplace 35.1 32.8 17.9 8.5 210.5 150.8 99.7 71.9 331.1 248.3 177.5 142.9 

(1,2) 35.4 31.9 17.9 10.3 205.4 157.7 116.7 90.6 334.6 268.2 213.2 180.4 

(1.1) 35.0 31.3 18.5 11.1 200.5 161.9 124.7 95.3 329.1 280.8 229.3 188.5 
(1,0.5) 34.7 31.0 19.6 12.2 199.6 170.7 135.3 100.6 335.2 302.2 248.1 196.3 
(0.5,2) 37.9 36.8 18.3 7.3 242.6 167.3 104.0 70.8 395.3 272.8 182.8 145.7 
(0.5,1) 37.6 36.3 18.1 7.6 234.9 164.1 105.0 72.6 379.5 268.8 186.4 148.9 

(0.5,0.5) 37.4 35.7 17.9 7.9 227.5 161.1 106.2 74.2 363.6 266.2 190.9 151.9 

FP Laplace 0.8 1.0 0.8 0.4 16.1 11.3 7.6 4.2 53.3 28.7 17 8.9 

(1.2) 0.2 0.3 0.6 0.5 4.0 6.9 6.6 5.5 12.2 18.2 17.2 13.2 

(1.1) 0.2 0.4 0.7 0.5 6.4 10.2 9.4 6.9 23.7 34.0 29.2 18.7 
(1,0.5) 0.3 0.6 0.8 0.7 13.5 21.1 16.9 9.8 153.5 199.8 90.0 33.4 
(0.5,2) 0.1 0.1 0.1 0.2 1.1 2.5 2.2 2.2 2.9 5.5 5.4 5.1 
(0.5,1) 0.1 0.1 0.2 0.2 1.4 3.0 2.7 2.5 3.7 7.1 6.7 5.9 

(0.5,0.5) 0.1 0.1 0.2 0.2 1.7 3.7 3.1 2.8 5.5 9.5 8.6 6.8 

FDR Laplace 0.2 0.2 0.1 0.1 0.3 0.2 0.1 0.1 0.4 0.2 0.1 0.1 

(1.2) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 
(1,1) 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.1 0.2 0.3 0.2 0.2 
(1,.5) 0.2 0.1 0.1 0.1 0.3 0.3 0.2 0.2 0.6 0.6 0.5 0.2 
(0.5,2) 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.0 
(0.5,1) 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.1 

(0.5,0.5) 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 



In experiment 1, we used a range of values for a and b. The best overall 
choice seemed to be a = 1/2, b = 1, and so we focused solely on this choice 
in experiment 2. Indeed, although certain alternative choices produced im- 
provements in specific situations, we found a = 1/2, b = 1 to be a good all- 
purpose option because of its blend of good performance in estimation and 
testing. 

Overall, when squared error in estimation is used to decide between pro- 
cedures, our preferred Bayes procedure with a = 1/2, b = 1 wins slightly on 
experiment 2, while the empirical-Bayes thresholding procedure wins slightly 
on experiment 1. We attribute these differences to the relative tail weight of 
the two priors. The double-exponential prior has tails that are heavier than 
the Gaussian likelihood, but not as heavy as those of the hypergeometric 
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Table 2 

Experiment 2, random coefficients. The HIB prior set a = 1/2, b — 1 



Number nonzero 





Scale c: 




50 






100 






200 






500 




0.5 


1 


2 


0.5 


1 


2 


0.5 


1 


2 


0.5 


1 


2 


SSE 


HIB 


8.3 


16.0 


55.4 


28.8 


53.2 


125 


90.2 


235 


336 


181 


391 


604 




Laplace 


8.6 


16.1 


60.4 


29.5 


57.3 


136 


93.1 


250 


370 


180 


394 


646 


Fl> 


HIB 


0.0 


0.0 


0.2 


0.0 


0.2 


0.5 


0.1 


0.8 


3.3 


0.1 


0.9 


10.8 




Laplace 


0.4 


3.7 


1.1 


2.3 


1.6 


3.2 


23.5 


34 


19.1 


138 


134 


71.5 



inverted-beta priors we studied. This difference in tail weight becomes much 
more significant in the experiment with random coefficients, since draws 
from a £3 density produce some very large signals — much larger than sig- 
nals of size 7 in the "fixed coefficients" study. In experiment 2, however, 
the heavier-tailed priors are wasting some of their mass in areas of the pa- 
rameter space far from the origin. Since these areas are predestined to be 
unimportant by the particular choices of fixed signals, it is no surprise that 
a lighter-tailed prior such as the double-exponential will yield superior re- 
sults. Similarly, when the coefficients are slightly larger, as in the £3 signals 
from experiment 2, the heavier-tailed prior will outperform. 

But when the measuring stick is the false-positive rate, the fully Bayes 
procedure with smaller values of a and b wins. It produces far fewer false 
positives across the board, along with lower false-discovery rates (suggest- 
ing that it is not merely more conservative across the board in declaring an 
observation to be a signal). It therefore seems like the more robust choice. 
For situations when estimation is the goal, its performance is roughly com- 
parable to the existing Johnstone/Silverman procedure. Yet for situations 
when testing is the goal, the Bayes procedure appears more trustworthy. 

4. Testing for superior historical performance. 

4.1. Data preprocessing. Before applying our multiple-testing method, 
we preprocessed the data as follows. Let yu be the raw data point for com- 
pany % in year t. We first standardized the data to have zero mean and unit 
variance across all countries and years. Using Bayesian treed-regression soft- 
ware [Gramacy and Lee (2008)], we then estimated a conditional mean m# 
and a conditional standard deviation su , representing the expected distribu- 
tion of performance for other firms in company i's peer group in year t. As 
covariates, we used a company's industry, size, leverage, country of opera- 
tion, and market share. For an extensive discussion of how this issue relates 
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to the disambiguation of so-called "Schumpeterian" rents from "monopolis- 
tic" rents, see Henderson, Raynor and Ahmed (2009). 

The regression-tree approach allows us to account for the highly non- 
linear, conditionally heteroskedastic relationships present in the data. An 
instructive comparison can be found in Figure 1, which shows three firms: 
JPMorgan Chase, IBM, and Gap Instrument Corporation. It is clear that 
the three firms have noticeably different peer-group means, and drastically 
different peer-group standard deviations. The left-hand plots show the actual 
performance, along with the "benchmark distribution" — that is, the mean 
and standard deviation of that year's expected performance, given firm-level 
covariates. The right-hand plots show the performance with respect to the 
benchmark distribution, all on a common normal-CDF scale. Supplemental 
files available upon request from the authors show the results of an extensive 
exploratory analysis of ROA versus important covariates, and substantiates 
our claim that nonlinear, conditionally heteroskedastic regression is essential 
here. 

We then computed a z-score zu = (yu — ma)/ ' su for each company-year 
data point. We emphasize that the term ma accounts only for the effects 
of covariates, and does not include a random effect specific to the firm in 
question. Therefore, if firm i systematically performs Hi standard deviations 
above (or below) its peer-group mean, and each year's performance is con- 
ditionally independent given /Xj, then 

[zit | m) ~ N(/ij, 1) for i = l, ...,m. 

If Hi = 0, then the sample mean of the Za's for firm i is normally distributed 
with mean and variance l/n,, where nj is the number of observations 
we have for that firm (ranging from 5 to 43). This is our preliminary null 
hypothesis. Stated in an equivalent form, 

Zi = Zi^frTi^ N(0, 1). 

These z-scores are the raw inputs to our multiple-testing approach. Based 
on the simulation results above, we are reporting results for a = 1/2, b = 1, 
which seemed to provide the best overall results in terms of testing. 

4.2. Summary of results. We ran the proposed multiple-testing method 
on the cohort of firms for which at least 5 years of past data were available. 
This initial sieve left us with a cohort of 37,014 firms, each with somewhere 
between 5 and 43 annual observations. 

Of the tested cohort, 1,076 firms (or about 3%) had posterior probabilities 
of outperformance larger than 90%, indicating moderate to high confidence 
that they have systematically outperformed their peer groups. For this co- 
hort, the expected group-wise false discovery rate (FDR) is 2%; this can be 
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Table 3 

Ten firms with the highest posterior probabilities of having a nonzero mean 



Company 



Description 



Books 



Alfacell Corporation 
Wyeth 

American List Corp 

Deluxe Corp 

Tambrands 

Toth Aluminum 

UST 

WD-40 

Landauer 

Merck 



A biotech firm specializing in RNA-based technologies. 

Large drug company; recently bought out by Pfizer. 

Bulk mailing firm. Bought out in 1997. 

Financial and logistical services for small businesses. 

Personal hygiene products. Bought out in 1997. 

Developed aluminum technology. Defunct. 

A tobacco holding company. Bought out in 2009. 

Manufactures the anticorrosive and lubricating agent. 

Specializes in services relating to radiation safety. 

Large drug company. 



BTL, 1SE 



computed by simply averaging the posterior probabilities that each firm in 
the cohort comes from the null model. An additional 705 firms had posterior 
probabilities of outperformance between 50% and 90%. For this intermediate 
group, the expected FDR is 28%. 

The top 10 overall firms ranked by posterior probability are described in 
Table 3, along with the reason that firm dropped out of the database (if 
applicable). Of these 10 firms, 8 seemed to outperform their peer group, 
while 2 seemed to underperform. The first non- American firm on the list is 
British-American Tobacco, incorporated in (of all places) Malaysia, which 
ranks 11th by estimated posterior inclusion probability. 

The historical trajectories for these 10 firms can be seen in Figure 2. Two 
are large drug companies; the rest come from a variety of different industries. 
All but four — Wyeth, Merck, Tambrands, and WD-40 — are likely unknown 
to the average consumer. 

These results are best thought of as a reasonable upper bound to the 
actual number of sustained superior performers. This is true for at least two 
reasons. First, although we used all data for 53,038 firms to fit the regression 
tree models and compute ma and sa, we did not conduct hypothesis tests 
for the 16,024 firms with less than 5 years of data. It is difficult to know what 
"long-term superiority" even means for this vast group of firms with so short 
a history. Moreover, their presence in the testing stage of the analysis would 
likely bias the estimate of w (the prior inclusion probability) downward, 
because the Bayes factor so strongly favors the null hypothesis for such 
a short trajectory. (This results from the well-known Bayesian "Occam's 
razor" effect that arises when comparing models of different dimensionality.) 
This introduces a possible survivorship bias into our procedure. But given 
the assumption of exchangeability in our model, we believe that the effects of 
survivorship bias are less severe than the likely effects of watering down the 
cohort with so many firms for which the null hypothesis is so likely a priori. 
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Fig. 2. The performance trajectories for the ten firms with the highest posterior proba- 
bilities of having a nonzero mean. 
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Second, and more importantly, our analysis assumes that a company's 
ROA result in year t is independent of results from previous years, given 
the peer group mean and standard deviation. This is unlikely to be exactly 
true, and therefore introduces an upward bias in our estimate of the number 
of superior performers (due to the fact that autocorrelation reduces the 
effective sample size available for testing Hq). 

One way of accounting for this bias is to introduce specific parametric 
assumptions about the nature of a "true null" trajectory. Indeed, this is an 
active and promising area of research in both this and in parallel fields (e.g., 
time-course microarray data). Our focus on this paper, however, is on large- 
scale screening with relatively few assumptions. We therefore eschew explicit 
parametric longitudinal models and adopt the following alternative strategy 
in an attempt to get a fast, crude assessment of how the independence 
assumption may affect our results: 

(1) For each firm in the testing cohort, we estimate a one-lag autocor- 
relation coefficient, <fii. For the handful of firms for which this estimate is 
negative, we threshold at zero, since we do not wish to introduce negative 
correlation into the sampling distribution for the data. 

(2) We compute an effective sample size for each trajectory as 

m = m-\ - 

using the well-known correction for autocorrelation. While this is motivated 
by simple AR(l)-type null models, one may interpret the multiplicative term 
involving <j> purely as a deflator, corresponding to the reduction in informa- 
tion in each longitudinal sample compared to the i.i.d. case. 

(3) We recompute the z-score as Zi = Zi^/nl. 

We then repeat the testing procedure using the z^s as data, which has the 
effect of inflating the variance under the null hypothesis. This correction 
led to 262 firms with a posterior probability greater than 90% (expected 
FDR for the group: 2%), and an additional 222 with a posterior probability 
between 50% and 90% (expected FDR for the group: 26%). The top 10 firms 
remained unchanged, except for Toth Aluminum and Alfacell. 

Our results appear to be qualitatively similar to those of Henderson, 
Raynor and Ahmed (2009), who use essentially the same data. But we will 
point to two important methodological differences that likely account for 
any major divergence in testing outcomes. First, we use model-averaged 
estimates from Bayesian treed regression to estimate a conditional mean 
and standard deviation for every company in every year. In contrast, Hen- 
derson, Raynor and Ahmed (2009) use linear quantile regression, which is 
a fundamentally different — and arguably less flexible — way of accounting for 
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Table 4 

The popular books selected for comparison 



Title 


Published 


Selection method 


Basis 


Good to Great 


2001 


Companies from 1965-1981 selected 
on the basis of shareholder return 


Quantitative 


Built to Last 


1994 


Oompanies rounded berore 1950 that 
met certain success criteria 


Ouahtative 


In Search of Excellence 


1982 


Surveys of executives at author- 
selected firms 


Qualitative 


Competitive Strategy 


1980 


Author selected examples to support 
theory; method unclear 


Qualitative 


Hidden Values 


2000 


Author selected examples to support 
theory; method unclear 


Qualitative 


Blueprint to a Billion 


2006 


Time to achieve $1 billion in revenue 
after initial public offering 


Quantitative 


What Really Works 


2003 


Correspondence with prespecified "top 
management practices" 


Qualitative 


Stall Points 


2008 


Patterns of stalls and recovery in rev- 
enue growth 


Quantitative 


Blue Ocean Strategy 


2005 


Author selected examples to support 
theory; method unclear 


Qualitative 



conditional heteroskedasticity (which appears to be the dominant effect of 
covariates). Second, we adjust each company's longitudinal results individu- 
ally to account for firm-level heterogeneity with respect to autocorrelation. 
In contrast, Henderson, Raynor and Ahmed (2009) account for longitudi- 
nal dependence by assuming that the same semi-parametric Markov model 
holds across the entire population of "lucky" firms. 

4.3. Comparison with the popular literature on corporate success. As 
a small aside, it is interesting to compare these results to the conclusions 
of a handful of well-known books that purport to explain corporate suc- 
cess. We took a small, nonscientific sample of these books, in an attempt 
to gauge whether the results from the multiple-testing model correspond to 
widely held notions about successful firms. Table 4 briefly describes these 
books, and indicates whether the basis for selecting the study cohort was 
qualitative or quantitative in nature. The books were chosen in conjunction 
with a group of senior management consultants at Deloitte Consulting, who 
judged the list to be fairly representative of the popular literature. 

These books follow a common recipe: start with a group of companies; 
identify the "successful" ones; look for patterns in their behavior; and ab- 
stract those behaviors into a small set of principles that can tell others how 
to run their businesses better. One important difference between these books 
and the approach considered here is the choice of outcome variable. In some 
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books the outcome variable is multidimensional, and therefore richer than 
our choice of ROA. Thus, while comparisons are instructive, they do not 
support the conclusion that our study is objectively right and the others 
wrong. Moreover, as a referee observed, the authors of these books may 
have different things in mind when they define success. 

Yet, collectively, these studies exhibit many unacknowledged sources of 
bias, which our study attempts to address. None, for example, make a se- 
rious attempt to verify statistically that the selected companies have done 
anything special when compared with a suitable reference population. This 
opens up the possibility that they have been studying companies that were 
lucky, rather than great — the precise null hypothesis considered in this pa- 
per. There are also serious issues with selection bias — both in terms of metric 
selection and of company selection — and of survivorship bias (although our 
study is also imperfect in this regard) . 

Perhaps for these reasons, serious discrepancies emerged between the pop- 
ular literature and the conclusions of the multiple-testing procedure consid- 
ered here. Across the nine books considered, there were 209 distinct, firms 
that were used as case studies — some positive, some negative — and that also 
appeared in our cohort of firms with 5 or more years of data. Of the top 
ten firms flagged in the previous section, only one was mentioned in any of 
the 9 books: Merck, a case study in Built to Last (BTL) and In Search of 
Excellence (ISE). Of the 209 firms collectively mentioned in these books, 
only 9 appear on our list of firms with ROA trajectories significantly better 
than those of their peer groups, once longitudinal dependence is accounted 
for. 

5. Final remarks. We have developed a Bayesian multiple-testing pro- 
cedure based upon a heavy-tailed prior for the nonzero means. These priors 
form an interesting, novel class of normal variance mixtures, the hyperge- 
ometric inverted-beta class. Overall, the procedure has the nice theoretical 
property of a redescending score function under the alternative model, and 
seems to perform as well as, or better than, existing gold-standard meth- 
ods. Moreover, it allows relevant Bayesian and frequentist summaries to be 
computed with minimal computational fuss. This property arises from the 
simple, known form of the marginal distribution m(y). 

We have applied the method to a large data set on historical corporate 
performance, and compared the results of our analysis to some popular books 
that deal with the same subject. These books appear to be studying a sample 
where the large majority of firms have ROA performance profiles that are 
statistically indistinguishable from luck. Meanwhile, there on the order of 
hundreds of firms (out of a group of over 37,000) whose performance is at 
least suggestive of a sustained advantage, and yet were not considered in 
these high-profile case studies. 
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APPENDIX A: THE PROPOSED FAMILY OF PRIORS 

A.l. Connection with classical shrinkage rules. Our new class of priors 
has its genesis in the large body of work on classical shrinkage rules, where 
a multivariate normal prior (3 ~ N(0, A 2 /) is assumed, where f3 = (/?i, . . . , f3 p ). 
Many common estimators for this problem, both Bayesian and non-Bayesian, 
are of the form (3(y) = {1 — g(Z)}y for Z = ||y|| 2 [e.g., James and Stein 
(1961), Strawderman (1971), Stein (1981), Fourdrinier, Strawderman and 
Wells (1998)]. The central issue is how to identify "nice" functions g(Z), 
and how to understand priors for global variance components in terms of 
the behavior of the estimators they yield. 

The constraint to rationality — that is, the requirement that there exists 
a prior p(n) such that, for all Z, g(Z) = E(k \ Z) under the posterior p(n \ 
Z) — rules out a wide class of potential estimators. The function g{Z) cannot, 
for example, be a polynomial of order two or greater. Indeed, the functional 
form of a g(Z) that respects admissibility will typically be quite complicated. 

It is natural to look in the class of estimators where g(Z) = p(Z)/q(Z), 
a ratio of power-series expansions. One can construct such a g{Z) by assum- 
ing that ((3 | A 2 ) ~ N(0, A 2 /), and then defining /3(A 2 ) = E(J3 | A 2 ,y). After 
removing the dependence upon A 2 by marginalizing, this leads to 

P = E X 2 ly 0(\ 2 )} = {l-E( K \Z)}y, 

recalling that k = 1/(1 + A 2 ). We can therefore identify g(Z) with E(k \ Z), 
the posterior expectation of k, given Z. 

One can define a class of priors for k indexed by (a, b, r, s), which we call 
the hyper geometric inverted-beta class, such that 

g(Z)=E( K \Z) 

(A.l) 

a + p/2 $i(ft, 1; a + b + p/2 + 1; s + Z/2, 1 - 1/r 2 ) 
~ a + b + p/2 $i(6,l;a + 6 + p/2;s + Z/2,l-l/r 2 ) ' 

where a, b, and r are positive real numbers; s is any real number; and <3?i 
is the degenerate hypergeometric function of two variables [Gradshteyn and 
Ryzhik (1965), Equations 9.261.1-9.261.3]. 

This g is a ratio of power series, and can be computed quite rapidly for 
a given tuple (a, 6, r, s) and a given Z. It leads to a large class of admissible 
estimators with a wide range of possible behavior. In particular, it includes 
many estimators that exhibit robustness to large values of Z; many estima- 
tors that offer significant risk reduction near Z = 0; and many that do both. 
This class generalizes the form noted by Maruyama (1999), which contains 
the positive-part James-Stein estimator as a limiting (improper) case. 

A.2. Hypergeometric inverted-beta priors. The connection with mul- 
tiple testing is as follows. Recall that under the alternative model, f3i is 
conditionally normal with variance A 2 . Our approach is to work with the 
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transformed variable Kj = l/(1 + A 2 ), and to define the following prior for /tj. 
Suppressing subscripts for the moment, 

(A.2) p( K )=C~ 1 K a - 1 {l- K ) b - 1 ^ + exp(-sK), 

where a,b, r > and s£l, and where C\ is a constant of proportionality. 
We denote the hypergeometric-beta prior on the k scale by k ~ HB(a, 6, r, s). 
The normalizing constant 

(A.3) C = J Q ^ a_1 (l -k) 6 " 1 + fi-^V} exp(-sK)dK 

can be computed using hyper geometric series. Using the theory laid out in 
Gordy (1998) and Poison and Scott (2011), we get 

(A.4) C = e~ s Be(a, 6)$i(6, 1, a + 6, s, 1 - 1/r 2 ), 

where $i is the degenerate hyper geometric function of two variables [Grad- 
shteyn and Ryzhik (1965), 9.261]. This function can be calculated accurately 
and rapidly by transforming it into a convergent series of 2F1 functions [Sec- 
tion 9.2 of Gradshteyn and Ryzhik (1965), Gordy (1998)], making evaluation 
of (A.4) quite fast for most allowable choices of the parameters. 
The implied density for A? takes the form 

(A.5) p(A 2 ) = C-\\>r\\* + 1)-^ expj-^Hr 2 + ^}^- 

This is a generalization of the inverted-beta distribution, also known as 
Pearson's type VI distribution. Indeed, it reduces to an inverted beta in the 
special case where s = 0,r = 1, in which case a\ 2 /b will follow an F(2b,2a) 
density. 

The hyper geometric inverted-beta family contains many well-known sub- 
families of priors for k. These include the beta distribution, the generalized 
beta distribution [McDonald and Xu (1995)], and the Gauss hypergeomet- 
ric distribution [Armero and Bayarri (1994)]. The family is itself contained 
in the class of compound confluent hypergeometric distributions [Gordy 
(1998)], which has two extra parameters that are not relevant in this con- 
text. These various related families are why we call (A.5) the hypergeometric 
inverted-beta prior. The transformed density on the k scale resembles a beta 
distribution, and we call this family the hypergeometric-beta (HB) prior. 

The family in (A.2) has one major advantage over other similar priors: 
there exist easily computable expressions for the posterior mean E(/3j | y^) 
and the marginal density mi(i/i) = / N(yj | a 2 )p{j3i) d/3j under the hy- 
pothesis that f3i 7^ 0. We derive these expressions in Appendix B. 

A.3. Shrinkage profiles. We now turn to the specification of the four 
hyperparameters, and to the different "local shrinkage profiles" that are 
accessible through different choices of these parameters. 
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Double-Exponential Prior Cauchy Prior 




Shrinkage Coefficient, Kappa Shrinkage Coefficient, Kappa 

Fig. 3. Implied shrinkage profiles for double- exponential and Cauchy priors. 

All normal scale-mixtures have an implied shrinkage profile p{ni), which 
describes the amount of shrinkage toward the origin that is expected a priori. 
The prior's behavior near m = controls the tail weight of the marginal prior 
for /3i, while the behavior near K\ = \ controls the strength of shrinkage near 
zero. 

Figure 3 plots the implied shrinkage profiles for two common priors: the 
double-exponential and Cauchy priors. Contrast these shrinkage profiles with 
the wide range of shapes that are accessible through the hypergeometric 
inverted-beta density, some of which are shown in Figure 4. 

One important special case of the hypergeometric inverted-beta family is 
the Strawderman prior [Strawderman (1971)], which corresponds to a = 1/2, 
6 = 1, s = 0, and r = 1. Another special case is the half- Cauchy prior on the 
scale factor A, studied by Gelman (2006) and Carvalho, Poison and Scott 
(2010). This corresponds to a = b = 1/2, s = 0, and r = 1. Yet a third special 
case is the uniform-shrinkage prior, where a = b = 1, s = 0, and r = 1. All of 
these can be seen in the upper-left pane of Figure 4. 

Clearly, (A. 2) can lead to many standard-looking shapes that are similar 
to other normal scale mixtures. Yet it can also produce a wide variety of 
other densities that are inaccessible through other standard families. We now 
describe the role of each hyperparameter, recalling that more probability 
near k = 1 means more aggressive shrinkage. 

First, r is a global scaling factor, with larger values leading to larger 
marginal variance in p. To see this, suppose that all components of (3 have 
a common variance component in addition to their idiosyncratic ones: (j/, | 
Pi) ~ N(/3j,<7 2 ) and Pi ~ N(0, a 2 T 2 X 2 ). The form involving r in (A. 2) arises 
from the special case of assuming a half- Cauchy prior for each Aj, as in the 
horseshoe prior of Carvalho, Poison and Scott (2010). The generalization 
of the scaled half-Cauchy prior to arbitrary a, b, and s then arises quite 
naturally on the k scale. Shifting r up and down causes the shrinkage profile 
to be shifted left and right, respectively, controlling the overall aggressiveness 
of shrinkage. 
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Special Cases 
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Shrinkage Coefficient, Kappa (s = 10, tau = 1/4) 



Fig. 4. Effect of changing the four parameters (a, b, s, r) on the density for the shrinkage 
coefficient k. 



The parameters a and b are analogous to those of beta distribution, to 
which (A. 2) reduces when r = 1 and s = 0. Smaller values of a encourage 
heavier tails in 7r(/3), with a = 1/2, for example, yielding Cauchy-like tails. 
Smaller values of b encourage p(/3) to have more mass near the origin, and 
eventually to become unbounded; 6 = 1/2 yields, for example, p(/3) log(l + 
1//3 2 ) near 0. 

Finally, s is a second global scaling factor, though with a different effect 
than r on the shape of the density. This parameter has an interpretation as 
a "prior sum of squares," with the caveat that it can also be negative. 

The scale parameters r and s do not control the behavior of vr(A) at 
and oo. Specifically, vr(A) behaves like A 26-1 near the origin, and like Aj ( 2a+1 ) 
in the upper tail. Since vr(/3) has the same polynomial rate of decay as 7r(A), a 
can be chosen to reflect the desired tail weight of 7r(/3). 
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A.4. The score function and overshrinkage of exceptional observations. 

We recall the following theorem from Carvalho, Poison and Scott (2010). 

Theorem A.l. Let p(\y — f3\) be the likelihood, and suppose that p((3) 
is a mean-zero scale mixture of normals: (f3 | A) ~N(0,A 2 ), with A having 
proper prior p(X) . Assume further that the likelihood and p(f3) are such that 
the marginal density m(y) < 00 for all y. Define the following three pseudo- 
densities, which may be improper: 



m*(y)= / p(\y-P\)p*(P)d/3, 
P*(I3)= [ p(/3|A)p*(A)dA, 



p*(A) = A 2 p(A). 



Then 



(A.6) 



E((3\y) = ^pl^-logm^y) 
m(y) dy 

—m (y). 



m(y) dy 



Versions of this representation theorem appear in Masreliez (1975), Pol- 
son (1991) and Pericchi and Smith (1992). Theorem A.l relaxes a specific 
regularity condition having to do with the boundedness of p((3), and extends 
the usual result to situations where p(/3) is a scale mixture of normals with 
proper mixing density and finite marginal m(y). 

The theorem characterizes the behavior of an estimator in the presence 
of large signals. Specifically, it says that we can achieve "inherent Bayesian 
robustness" by choosing a prior for f3 such that the derivative of the log 
predictive density is bounded as a function of y. Ideally, of course, this bound 
should converge to for large \y\, and will lead to E(0 | y) « y for large \y\. 
This will avoid the overshrinkage of exceptional observations — clearly an 
important goal in large-scale simultaneous testing problems. 

It is easy to verify, using the results of the previous subsection, that nor- 
mal scale mixtures with hyper geometric inverted-beta mixing distributions 
satisfy the property of tail robustness. This helps to explain their good per- 
formance in high-dimensional settings. 

A. 5. The effect of shared shrinkage parameters. The hyper geometric 
inverted-beta prior allows a combination of global and local shrinkage that 
can be both flexible and robust. Figure 5 shows how a very small value of r, 
encouraging strong global shrinkage, can be reinforced by a small observa- 
tion [y = 1.0), and yet be almost completely overruled by a large observation 
{y = 4.0). Meanwhile, the marked bimodality for an intermediate observa- 
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Prior Posterior, y=1.0 Posterior, y=2. 5 Posterior, y=4.0 




Kappa Kappa Kappa Kappa 



Fig. 5. The left pane shows the prior for n when r — 1/15, s = 0, and a = b = 1/2, re- 
flecting a prior bias for strong shrinkage. The next three panes show the different posteriors 
for k upon observing a single data point: y — 1.0, y = 2.5, or y = 4.0, respectively. 

tion such as y = 2.5 reflects uncertainty about whether such an observation 
corresponds to signal or noise, with the posterior mean for j3 averaging over 
both possibilities. 

This example demonstrates that global shrinkage through r can be very 
effective at squelching noise in high-dimensional problems. It is crucial, how- 
ever, that r be estimated from the data, and that the prior for Ki grow 
sufficiently fast near in order to allow Kj to escape the strong "gravita- 
tional pull" of a small r when yi is large (as in this example when yt/a = 4). 
We recommend setting a = 1/2 in sparse problems involving a normal likeli- 
hood; see Carvalho, Poison and Scott (2010) for further discussion. In situa- 
tions with heavier-tailed sampling models, it may be appropriate to choose 
a smaller value of a. 

When 1 — 1 jr 2 is very close to 1 (or when 1 — r 2 is very close to 1 for r < 1 ) , 
the $1 functions may become slow to evaluate due to the slow convergence 
of the series representations given in the Appendix. In our experience, the 
issue becomes practically significant in a serial computing environment only 
when t 2 is larger than 1,000 or smaller than 1/1,000. Additionally, global 
shrinkage can take place through s rather than r (with r being set equal 
to 1). Then m ~ HB(a, b,T = 1, s), and so 

| Vi ) ~ HB(o + 1/2, b,r = l,s + yf /2a 2 ). 

Figure 6 shows that global shrinkage through s can produce results quite 
similar to global shrinkage through r. 

APPENDIX B: EXPRESSIONS FOR MOMENTS AND MARGINALS 

Throughout this section, we suppress conditioning on nonzero status. 
Under our hypergeometric inverted-beta model, the joint distribution for yi 
and Ki takes the form 

p( yi ,Ki) OC K?-\l - Kt ) b ~^l+ (l - ^\ K \ V^', 

where now s' = s + yf/(2a 2 ) and a' = a + 1/2. 
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Fig. 6. The left pane shows the prior for k when r = 1, a = b = 1/2, and s = —4. The 
next three panes show the different posteriors for k upon observing a single data point: 
y = 1.0 j y — 2.5 ? or y = 4.0, respectively. 

The moment-generating function of (A. 2) is easily shown to be 
M(f)= t $i(b,l,a + b,s-t,l-l/T 2 ) 
{> $!(6,l,a + 6, S ,l-l/r 2 ) " 

See, for example, Gordy (1998). Expanding $1 as a sum of \F\ functions 
and using the differentiation rules given in Chapter 15 of Abramowitz and 
Stegun (1964) yields 



(B.l) 



E(^|y,a 2 ) 



*1 



l,a' + 6 + n,s',l-l/r 2 ) 



(a' + b) n $i(M,a' + M',l-l/T 2 ) 

a' $ 1 (6,l,a' + 6+ l,s',l - 1/r 2 ) 
~ a' + b $i(M,a' + M',l-l/r 2 ) 
And by the law of total variance, 

Var(/3i | Vl ) = E{Var(/3i | y u m)} + Var{E(A | y h m)} 

= a 2 {l -E(«i | yi)} + y 2 Var(Kj | yi) 

with all other posterior moments for following in turn. 

There is also a tractable expression for the marginal likelihood of the data: 



Using (B.l), we get 
(B.2) E(A | Z/i) = (l 



(B.3) 



dKi 



(b.4) m (^ = cr l ^ 1 <'- l (i-K l ) fe - 1 |^ + (i-^)^| V 

where again s' = s + yf /(2a 2 ) and a' = a + 1/2. This integral is in the same 
family as (A. 3), and so by the same series of arguments we obtain 

1 / y 2 \ Be(a',6) ^(M.o' + M', 1 - 1/t 2 ) 



(B.5) m(yi 



V2 



: CXp 



TTG Z 



2a 2 J Be(a,b) $i(6, 1, a + b, s, 1 - 1/r 2 ) 
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